Recommender Systems using Affinity Analysis


Here we will look at affinity analysis that determines when objects occur frequently together. This is also called market basket analysis, after one of the use cases of determining when items are purchased together frequently. In this example, we wish to work out when two movies are recommended by the same reviewers.

Affinity analysis

Affinity analysis is the task of determining when objects are used in similar ways. The data for affinity analysis is often described in the form of a transaction. Intuitively, this comes from a transaction at a store—determining when objects are purchased together.

The classic algorithm for affinity analysis is called the Apriori algorithm. It addresses the exponential problem of creating sets of items that occur frequently within a database, called frequent itemsets. Once these frequent itemsets are discovered, creating association rules is straightforward.

Apriori algorithm

First, we ensure that a rule has sufficient support within the dataset. Defining a minimum support level is the key parameter for Apriori. To build a frequent itemset, for an itemset (A, B) to have a support of at least 30, both A and B must occur at least 30 times in the database. This property extends to larger sets as well. For an itemset (A, B, C, D) to be considered frequent, the set (A, B, C) must also be frequent (as must D). These frequent itemsets can be built up and possible itemsets that are not frequent (of which there are many) will never be tested. This saves significant time in testing new rules. Other example algorithms for affinity analysis include the Eclat and FP-growth algorithms. There are many improvements to these algorithms in the data mining literature that further improve the efficiency of the method. In this chapter, we will focus on the basic Apriori algorithm.

Choosing parameters

To perform association rule mining for affinity analysis, we first use the Apriori to generate frequent itemsets. Next, we create association rules (for example, if a person recommended movie X, they would also recommend movie Y) by testing combinations of premises and conclusions within those frequent itemsets. For the first stage, the Apriori algorithm needs a value for the minimum support that an itemset needs to be considered frequent. Any itemsets with less support will not be considered. Setting this minimum support too low will cause Apriori to test a larger number of itemsets, slowing the algorithm down. Setting it too high will result in fewer itemsets being considered frequent.

In the second stage, after the frequent itemsets have been discovered, association rules are tested based on their confidence. We could choose a minimum confidence level, a number of rules to return, or simply return all of them and let the user decide what to do with them.

Here, we will return only rules above a given confidence level. Therefore, we need to set our minimum confidence level. Setting this too low will result in rules that have a high support, but are not very accurate. Setting this higher will result in only more accurate rules being returned, but with fewer rules being discovered.

The movie recommendation problem

Product recommendation is big business. Online stores use it to up-sell to customers by recommending other products that they could buy. Making better recommendations leads to better sales. When online shopping is selling to millions of customers every year, there is a lot of potential money to be made by selling more items to these customers.

Product recommendations have been researched for many years; however, the field gained a significant boost when Netflix ran their Netflix Prize between 2007 and

  1. This competition aimed to determine if anyone can predict a user's rating of a film better than Netflix was currently doing. The prize went to a team that was just over 10 percent better than the current solution. While this may not seem like a large improvement, such an improvement would net millions to Netflix in revenue from better movie recommendations.

Obtaining the dataset

Since the inception of the Netflix Prize, Grouplens, a research group at the University of Minnesota, has released several datasets that are often used for testing algorithms in this area. They have released several versions of a movie rating dataset, which have different sizes. There is a version with 100,000 reviews, one with 1 million reviews and one with 10 million reviews. The datasets are available from http://grouplens.org/datasets/movielens/ and the dataset we are going to use in this chapter is the MovieLens 1 million dataset. Download this dataset and unzip it in your data folder. We then load the dataset using Pandas. The MovieLens dataset is in a good shape; however, there are some changes from the default options in pandas.read_csv that we need to make. To start with, the data is separated by tabs, not commas. Next, there is no heading line. This means the first line in the file is actually data and we need to manually set the column names. When loading the file, we set the delimiter parameter to the tab character, tell pandas not to read the first row as the header (with header=None), and set the column names.


In [2]:
ratings_filename = "data/ml-100k/u.data"

In [3]:
import pandas as pd

In [4]:
all_ratings = pd.read_csv(ratings_filename, delimiter="\t", header=None, names = ["UserID", "MovieID", "Rating", "Datetime"])
all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'],unit='s')
all_ratings[:5]


Out[4]:
UserID MovieID Rating Datetime
0 196 242 3 1997-12-04 15:55:49
1 186 302 3 1998-04-04 19:22:22
2 22 377 1 1997-11-07 07:18:36
3 244 51 2 1997-11-27 05:02:03
4 166 346 1 1998-02-02 05:33:16

Sparse data formats:

This dataset is in a sparse format. Each row can be thought of as a cell in a large feature matrix of the type used in previous chapters, where rows are users and columns are individual movies. The first column would be each user's review of the first movie, the second column would be each user's review of the second movie, and so on. There are 1,000 users and 1,700 movies in this dataset, which means that the full matrix would be quite large. We may run into issues storing the whole matrix in memory and computing on it would be troublesome. However, this matrix has the property that most cells are empty, that is, there is no review for most movies for most users. There is no review of movie #675 for user #213 though, and not for most other combinations of user and movie.


In [5]:
# As you can see, there are no review for most movies, such as #213
all_ratings[all_ratings["UserID"] == 675].sort("MovieID")


C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  from ipykernel import kernelapp as app
Out[5]:
UserID MovieID Rating Datetime
81098 675 86 4 1998-03-10 00:26:14
90696 675 223 1 1998-03-10 00:35:51
92650 675 235 1 1998-03-10 00:35:51
95459 675 242 4 1998-03-10 00:08:42
82845 675 244 3 1998-03-10 00:29:35
53293 675 258 3 1998-03-10 00:11:19
97286 675 269 5 1998-03-10 00:08:07
93720 675 272 3 1998-03-10 00:07:11
73389 675 286 4 1998-03-10 00:07:11
77524 675 303 5 1998-03-10 00:08:42
47367 675 305 4 1998-03-10 00:09:08
44300 675 306 5 1998-03-10 00:08:07
53730 675 311 3 1998-03-10 00:10:47
54284 675 312 2 1998-03-10 00:10:24
63291 675 318 5 1998-03-10 00:21:13
87082 675 321 2 1998-03-10 00:11:48
56108 675 344 4 1998-03-10 00:12:34
53046 675 347 4 1998-03-10 00:07:11
94617 675 427 5 1998-03-10 00:28:11
69915 675 463 5 1998-03-10 00:16:43
46744 675 509 5 1998-03-10 00:24:25
46598 675 531 5 1998-03-10 00:18:28
52962 675 650 5 1998-03-10 00:32:51
94029 675 750 4 1998-03-10 00:08:07
53223 675 874 4 1998-03-10 00:11:19
62277 675 891 2 1998-03-10 00:12:59
77274 675 896 5 1998-03-10 00:09:35
66194 675 900 4 1998-03-10 00:10:24
54994 675 937 1 1998-03-10 00:35:51
61742 675 1007 4 1998-03-10 00:25:22
49225 675 1101 4 1998-03-10 00:33:49
50692 675 1255 1 1998-03-10 00:35:51
74202 675 1628 5 1998-03-10 00:30:37
47866 675 1653 5 1998-03-10 00:31:53

The format given here represents the full matrix, but in a more compact way. The first row indicates that user #196 reviewed movie #242, giving it a ranking of 3 (out of five) on the December 4, 1997. Any combination of user and movie that isn't in this database is assumed to not exist. This saves significant space, as opposed to storing a bunch of zeroes in memory. This type of format is called a sparse matrix format. As a rule of thumb, if you expect about 60 percent or more of your dataset to be empty or zero, a sparse format will take less space to store.

When computing on sparse matrices, the focus isn't usually on the data we don't have—comparing all of the zeroes. We usually focus on the data we have and compare those.

The Apriori implementation

The goal of this chapter is to produce rules of the following form: if a person recommends these movies, they will also recommend this movie. We will also discuss extensions where a person recommends a set of movies is likely to recommend another particular movie.

To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie:


In [10]:
# Not all reviews are favourable! Our goal is "other recommended books", so we only want favourable reviews
all_ratings["Favorable"] = all_ratings["Rating"] > 3
all_ratings[10:15]


Out[10]:
UserID MovieID Rating Datetime Favorable
10 62 257 2 1997-11-12 22:07:14 False
11 286 1014 5 1997-11-17 15:38:45 True
12 200 222 5 1997-10-05 09:05:40 True
13 210 40 3 1998-03-27 21:59:54 False
14 224 29 3 1998-02-21 23:40:57 False

In [11]:
all_ratings[all_ratings["UserID"] == 1][:5]


Out[11]:
UserID MovieID Rating Datetime Favorable
202 1 61 4 1997-11-03 07:33:40 True
305 1 189 3 1998-03-01 06:15:28 False
333 1 33 4 1997-11-03 07:38:19 True
334 1 160 4 1997-09-24 03:42:27 True
478 1 20 4 1998-02-14 04:51:23 True

We will sample our dataset to form a training dataset. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users:


In [12]:
# Sample the dataset. You can try increasing the size of the sample, but the run time will be considerably longer
ratings = all_ratings[all_ratings['UserID'].isin(range(200))]  # & ratings["UserID"].isin(range(100))]

Next, we can create a dataset of only the favorable reviews in our sample:


In [13]:
# We start by creating a dataset of each user's favourable reviews
favorable_ratings = ratings[ratings["Favorable"]]
favorable_ratings[:5]


Out[13]:
UserID MovieID Rating Datetime Favorable
16 122 387 5 1997-11-11 17:47:39 True
20 119 392 4 1998-01-30 16:13:34 True
21 167 486 4 1998-04-16 14:54:12 True
26 38 95 5 1998-04-13 01:14:54 True
28 63 277 4 1997-10-01 23:10:01 True

We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable. We can compute this by grouping the dataset by the User ID and iterating over the movies in each group:


In [14]:
# We are only interested in the reviewers who have more than one review
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"])
len(favorable_reviews_by_users)


Out[14]:
199

In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user. Sets are much faster than lists for this type of operation, and we will use them in a later code. Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review:


In [15]:
# Find out how many movies have favourable ratings
num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum()
num_favorable_by_movie.sort("Favorable", ascending=False)[:5]


C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()
Out[15]:
Favorable
MovieID
50 100.0
100 89.0
258 83.0
181 79.0
174 74.0

The Apriori algorithm revisited

The Apriori algorithm is part of our affinity analysis and deals specifically with finding frequent itemsets within the data. The basic procedure of Apriori builds up new candidate itemsets from previously discovered frequent itemsets. These candidates are tested to see if they are frequent, and then the algorithm iterates as explained here:

  1. Create initial frequent itemsets by placing each item in its own itemset. Only items with at least the minimum support are used in this step.
  2. New candidate itemsets are created from the most recently discovered frequent itemsets by finding supersets of the existing frequent itemsets.
  3. All candidate itemsets are tested to see if they are frequent. If a candidate is not frequent then it is discarded. If there are no new frequent itemsets from this step, go to the last step.
  4. Store the newly discovered frequent itemsets and go to the second step.
  5. Return all of the discovered frequent itemsets.

Implementation

On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in second step.

We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code:


In [17]:
frequent_itemsets = {}  # itemsets are sorted by length

We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but feel free to try different values. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's apply minimum support:


In [18]:
min_support = 50

To implement the first step of the Apriori algorithm, we create an itemset with each movie individually and test if the itemset is frequent. We use frozenset, as they allow us to perform set operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot).


In [19]:
# k=1 candidates are the isbns with more than min_support favourable reviews
frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"])
                                for movie_id, row in num_favorable_by_movie.iterrows()
                                if row["Favorable"] > min_support)

We implement the second and third steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function and the counting dictionary. In keeping with our rule of thumb of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small), it is a good practice to get into for larger applications. We iterate over all of the users and their reviews. Next, we go through each of the previously discovered itemsets and see if it is a subset of the current set of reviews. If it is, this means that the user has reviewed each movie in the itemset. We can then go through each individual movie that the user has reviewed that isn't in the itemset, create a superset from it, and record in our counting dictionary that we saw this particular itemset. We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those :


In [21]:
from collections import defaultdict

def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
    counts = defaultdict(int)
    for user, reviews in favorable_reviews_by_users.items():
        for itemset in k_1_itemsets:
            if itemset.issubset(reviews):
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 1
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

To run our code, we create a loop that iterates over the steps of the Apriori algorithm, storing the new itemsets as we go. In this loop, k represents the length of the soon-to-be discovered frequent itemsets, allowing us to access the previously most discovered ones by looking in our frequent_itemsets dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. We want to break out the preceding loop if we didn't find any new frequent itemsets (and also to print a message to let us know what is going on). If we do find frequent itemsets, we print out a message to let us know the loop will be running again. This algorithm can take a while to run, so it is helpful to know that the code is still running while you wait for it to complete! Finally, after the end of the loop, we are no longer interested in the first set of itemsets anymore—these are itemsets of length one, which won't help us create association rules – we need at least two items to create association rules. Let's delete them :


In [22]:
import sys

print("There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[1]), min_support))
sys.stdout.flush()
for k in range(2, 20):
    # Generate candidates of length k, using the frequent itemsets of length k-1
    # Only store the frequent itemsets
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],
                                                   min_support)
    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break
    else:
        print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
        #print(cur_frequent_itemsets)
        sys.stdout.flush()
        frequent_itemsets[k] = cur_frequent_itemsets
# We aren't interested in the itemsets of length 1, so remove those
del frequent_itemsets[1]


There are 16 movies with more than 50 favorable reviews
I found 93 frequent itemsets of length 2
I found 295 frequent itemsets of length 3
I found 593 frequent itemsets of length 4
I found 785 frequent itemsets of length 5
I found 677 frequent itemsets of length 6
I found 373 frequent itemsets of length 7
I found 126 frequent itemsets of length 8
I found 24 frequent itemsets of length 9
I found 2 frequent itemsets of length 10
Did not find any frequent itemsets of length 11

This code may take a few minutes to run.


In [23]:
print("Found a total of {0} frequent itemsets".format(sum(len(itemsets) for itemsets in frequent_itemsets.values())))


Found a total of 2968 frequent itemsets

As we can see it returns 2968 frequent itemsets of varying lengths. You'll notice that the number of itemsets grows as the length increases before it shrinks. It grows because of the increasing number of possible rules. After a while, the large number of combinations no longer has the support necessary to be considered frequent. This results in the number shrinking. This shrinking is the benefit of the Apriori algorithm. If we search all possible itemsets (not just the supersets of frequent ones), we would be searching thousands of times more itemsets to see if they are frequent.

Extracting association rules

After the Apriori algorithm has completed, we have a list of frequent itemsets. These aren't exactly association rules, but they are similar to it. A frequent itemset is a set of items with a minimum support, while an association rule has a premise and a conclusion.

We can make an association rule from a frequent itemset by taking one of the movies in the itemset and denoting it as the conclusion. The other movies in the itemset will be the premise. This will form rules of the following form: if a reviewer recommends all of the movies in the premise, they will also recommend the conclusion.

For each itemset, we can generate a number of association rules by setting each movie to be the conclusion and the remaining movies as the premise.

In code, we first generate a list of all of the rules from each of the frequent itemsets, by iterating over each of the discovered frequent itemsets of each length. We then iterate over every movie in this itemset, using it as our conclusion. The remaining movies in the itemset are the premise. We save the premise and conclusion as our candidate rule. This returns a very large number of candidate rules. We can see some by printing out the first few rules in the list.


In [24]:
# Now we create the association rules. First, they are candidates until the confidence has been tested
candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
    for itemset in itemset_counts.keys():
        for conclusion in itemset:
            premise = itemset - set((conclusion,))
            candidate_rules.append((premise, conclusion))
print("There are {} candidate rules".format(len(candidate_rules)))


There are 15285 candidate rules

In [25]:
print(candidate_rules[:5])


[(frozenset([50]), 64), (frozenset([64]), 50), (frozenset([127]), 181), (frozenset([181]), 127), (frozenset([127]), 1)]

These rules were returned as the resulting output.

In these rules, the first part (the frozenset) is the list of movies in the premise, while the number after it is the conclusion. In the first case, if a reviewer recommends movie 50, they are also likely to recommend movie 64.

Next, we compute the confidence of each of these rules. The process starts by creating dictionaries to store how many times we see the premise leading to the conclusion (a correct example of the rule) and how many times it doesn't (an incorrect example). We iterate over all of the users, their favorable reviews, and over each candidate association rule. We then test to see if the premise is applicable to this user. In other words, did the user favorably review all of the movies in the premise? If the premise applies, we see if the conclusion movie was also rated favorably. If so, the rule is correct in this instance. If not, it is incorrect. We then compute the confidence for each rule by dividing the correct count by the total number of times the rule was seen.


In [27]:
# Now, we compute the confidence of each of these rules. This is very similar to what we did in chapter 1
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in favorable_reviews_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1
rule_confidence = {candidate_rule: 
                   correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                    for candidate_rule in candidate_rules}

In [28]:
# Choose only rules above a minimum confidence level
min_confidence = 0.9

In [29]:
# Filter out the rules with poor confidence
rule_confidence = {rule: confidence for rule, confidence in rule_confidence.items() if confidence > min_confidence}
print(len(rule_confidence))


5152

Now we can print the top five rules by sorting this confidence dictionary and printing the results:


In [30]:
from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)

In [31]:
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    print("Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion))
    print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)]))
    print("")


Rule #1
Rule: If a person recommends frozenset([64, 56, 50, 98, 7]) they will also recommend 174
 - Confidence: 1.000

Rule #2
Rule: If a person recommends frozenset([98, 100, 7, 172, 50, 181]) they will also recommend 56
 - Confidence: 1.000

Rule #3
Rule: If a person recommends frozenset([64, 98, 100, 7, 172, 50]) they will also recommend 56
 - Confidence: 1.000

Rule #4
Rule: If a person recommends frozenset([64, 98, 100, 50, 56, 127]) they will also recommend 174
 - Confidence: 1.000

Rule #5
Rule: If a person recommends frozenset([98, 100, 172, 79, 50, 56]) they will also recommend 7
 - Confidence: 1.000

The resulting printout shows only the movie IDs, which isn't very helpful without the names of the movies also. The dataset came with a file called u.items, which stores the movie names and their corresponding MovieID (as well as other information, such as the genre).

We can load the titles from this file using pandas. Additional information about the file and categories is available in the README that came with the dataset. The data in the files is in CSV format, but with data separated by the | symbol; it has no header and the encoding is important to set. The column names were found in the README file


In [34]:
# Even better, we can get the movie titles themselves from the dataset
movie_name_filename = 'data/ml-100k/u.item'
movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman")
movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure",
                           "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir",
                           "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]

Getting the movie title is important, so we will create a function that will return a movie's title from its MovieID, saving us the trouble of looking it up each time. We look up the movie_name_data DataFrame for the given MovieID and return only the title column. We use the values parameter to get the actual value (and not the pandas Series object that is currently stored in title_object). We are only interested in the first value—there should only be one title for a given MovieID anyway! We end the function by returning the title as needed.


In [35]:
def get_movie_name(movie_id):
    title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"]
    title = title_object.values[0]
    return title

In [36]:
get_movie_name(4)


Out[36]:
u'Get Shorty (1995)'

We adjust our previous code for printing out the top rules to also include the titles


In [38]:
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name))
    print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)]))
    print("")


Rule #1
Rule: If a person recommends Shawshank Redemption, The (1994), Pulp Fiction (1994), Star Wars (1977), Silence of the Lambs, The (1991), Twelve Monkeys (1995) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

Rule #2
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Twelve Monkeys (1995), Empire Strikes Back, The (1980), Star Wars (1977), Return of the Jedi (1983) they will also recommend Pulp Fiction (1994)
 - Confidence: 1.000

Rule #3
Rule: If a person recommends Shawshank Redemption, The (1994), Silence of the Lambs, The (1991), Fargo (1996), Twelve Monkeys (1995), Empire Strikes Back, The (1980), Star Wars (1977) they will also recommend Pulp Fiction (1994)
 - Confidence: 1.000

Rule #4
Rule: If a person recommends Shawshank Redemption, The (1994), Silence of the Lambs, The (1991), Fargo (1996), Star Wars (1977), Pulp Fiction (1994), Godfather, The (1972) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

Rule #5
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Empire Strikes Back, The (1980), Fugitive, The (1993), Star Wars (1977), Pulp Fiction (1994) they will also recommend Twelve Monkeys (1995)
 - Confidence: 1.000

The result is much more readable (there are still some issues, but we can ignore them for now.)

Evaluation

In a broad sense, we can evaluate the association rules using the same concept as for classification. We use a test set of data that was not used for training, and evaluate our discovered rules based on their performance in this test set.

To do this, we will compute the test set confidence, that is, the confidence of each rule on the testing set. We won't apply a formal evaluation metric in this case; we simply examine the rules and look for good examples.

First, we extract the test dataset, which is all of the records we didn't use in the training set. We used the first 200 users (by ID value) for the training set, and we will use all of the rest for the testing dataset. As with the training set, we will also get the favorable reviews for each of the users in this dataset as well.


In [39]:
# Evaluation using test data
test_dataset = all_ratings[~all_ratings['UserID'].isin(range(200))]
test_favorable = test_dataset[test_dataset["Favorable"]]
#test_not_favourable = test_dataset[~test_dataset["Favourable"]]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby("UserID")["MovieID"])
#test_not_favourable_by_users = dict((k, frozenset(v.values)) for k, v in test_not_favourable.groupby("UserID")["MovieID"])
#test_users = test_dataset["UserID"].unique()

In [40]:
test_dataset[:5]


Out[40]:
UserID MovieID Rating Datetime Favorable
3 244 51 2 1997-11-27 05:02:03 False
5 298 474 4 1998-01-07 14:20:06 True
7 253 465 5 1998-04-03 18:34:27 True
8 305 451 3 1998-02-01 09:20:17 False
11 286 1014 5 1997-11-17 15:38:45 True

We then count the correct instances where the premise leads to the conclusion, in the same way we did before. The only change here is the use of the test data instead of the training data.


In [41]:
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1

Next, we compute the confidence of each rule from the correct counts.


In [42]:
test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                   for candidate_rule in rule_confidence}
print(len(test_confidence))


5152

In [44]:
sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(1), reverse=True)
print(sorted_test_confidence[:5])


[((frozenset([64, 1, 7, 172, 79, 50]), 174), 1.0), ((frozenset([64, 98, 7, 258, 174, 181]), 172), 1.0), ((frozenset([64, 1, 98, 7, 79, 181, 56]), 174), 1.0), ((frozenset([64, 1, 98, 172, 79, 181, 56]), 174), 1.0), ((frozenset([64, 1, 98, 7, 172, 79, 181]), 174), 1.0)]

Finally, we print out the best association rules with the titles instead of the movie IDs.


In [45]:
for index in range(10):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name))
    print(" - Train Confidence: {0:.3f}".format(rule_confidence.get((premise, conclusion), -1)))
    print(" - Test Confidence: {0:.3f}".format(test_confidence.get((premise, conclusion), -1)))
    print("")


Rule #1
Rule: If a person recommends Shawshank Redemption, The (1994), Pulp Fiction (1994), Star Wars (1977), Silence of the Lambs, The (1991), Twelve Monkeys (1995) they will also recommend Raiders of the Lost Ark (1981)
 - Train Confidence: 1.000
 - Test Confidence: 0.909

Rule #2
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Twelve Monkeys (1995), Empire Strikes Back, The (1980), Star Wars (1977), Return of the Jedi (1983) they will also recommend Pulp Fiction (1994)
 - Train Confidence: 1.000
 - Test Confidence: 0.811

Rule #3
Rule: If a person recommends Shawshank Redemption, The (1994), Silence of the Lambs, The (1991), Fargo (1996), Twelve Monkeys (1995), Empire Strikes Back, The (1980), Star Wars (1977) they will also recommend Pulp Fiction (1994)
 - Train Confidence: 1.000
 - Test Confidence: 0.824

Rule #4
Rule: If a person recommends Shawshank Redemption, The (1994), Silence of the Lambs, The (1991), Fargo (1996), Star Wars (1977), Pulp Fiction (1994), Godfather, The (1972) they will also recommend Raiders of the Lost Ark (1981)
 - Train Confidence: 1.000
 - Test Confidence: 0.848

Rule #5
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Empire Strikes Back, The (1980), Fugitive, The (1993), Star Wars (1977), Pulp Fiction (1994) they will also recommend Twelve Monkeys (1995)
 - Train Confidence: 1.000
 - Test Confidence: 0.609

Rule #6
Rule: If a person recommends Shawshank Redemption, The (1994), Silence of the Lambs, The (1991), Fargo (1996), Twelve Monkeys (1995), Empire Strikes Back, The (1980), Star Wars (1977) they will also recommend Raiders of the Lost Ark (1981)
 - Train Confidence: 1.000
 - Test Confidence: 0.971

Rule #7
Rule: If a person recommends Shawshank Redemption, The (1994), Toy Story (1995), Twelve Monkeys (1995), Empire Strikes Back, The (1980), Fugitive, The (1993), Star Wars (1977) they will also recommend Return of the Jedi (1983)
 - Train Confidence: 1.000
 - Test Confidence: 0.900

Rule #8
Rule: If a person recommends Toy Story (1995), Silence of the Lambs, The (1991), Fargo (1996), Raiders of the Lost Ark (1981), Godfather, The (1972) they will also recommend Pulp Fiction (1994)
 - Train Confidence: 1.000
 - Test Confidence: 0.750

Rule #9
Rule: If a person recommends Silence of the Lambs, The (1991), Godfather, The (1972), Empire Strikes Back, The (1980), Raiders of the Lost Ark (1981), Twelve Monkeys (1995) they will also recommend Shawshank Redemption, The (1994)
 - Train Confidence: 1.000
 - Test Confidence: 0.854

Rule #10
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Shawshank Redemption, The (1994), Godfather, The (1972) they will also recommend Silence of the Lambs, The (1991)
 - Train Confidence: 1.000
 - Test Confidence: 0.870

The fifth rule, for instance, has a perfect confidence in the training data (1.000), but it is only accurate in 60 percent of cases for the test data (0.609). Many of the other rules in the top 10 have high confidences in test data though, making them good rules for making recommendations.

Summary

In this example, we performed affinity analysis in order to recommend movies based on a large set of reviewers. We did this in two stages. First, we found frequent itemsets in the data using the Apriori algorithm. Then, we created association rules from those itemsets.

The use of the Apriori algorithm was necessary due to the size of the dataset. We performed training on a subset of our data in order to find the association rules, and then tested those rules on the rest of the data—a testing set. From what we discussed in the previous chapters, we could extend this concept to use cross-fold validation to better evaluate the rules. This would lead to a more robust evaluation of the quality of each rule